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Abstract 

The trimming scheme with a prefixed cutoff portion is 
known as a method of improving the robustness of statistical 
models such as multivariate Gaussian mixture models (MG- 
MMs) in small scale tests by alleviating the impacts of 
outliers. However, when this method is applied to real- 
world data, such as noisy speech processing, it is hard to 
know the optimal cut-off portion to remove the outliers and 
sometimes removes useful data samples as well. In this 
paper, we propose a new method based on measuring the 
dispersion degree (DD) of the training data to avoid this 
problem, so as to realise automatic robust estimation for 
MGMMs. The DD model is studied by using two different 
measures. For each one, we theoretically prove that the DD 
of the data samples in a context of MGMMs approximately 
obeys a specific (chi or chi-square) distribution. The 
proposed method is evaluated on a real-world application 
with a moderately-sized speaker recognition task. 
Experiments show that the proposed method can 
significantly improve the robustness of the conventional 
training method of GMMs for speaker recognition. 
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Introduction 

Statistical models, such as Gaussian Mixture Models 
(GMMs) [17] and Hidden Markov Models (HMMs) 
[15], are important techniques in many signal 
processing domains, which include, for instance, 
acoustical noise reduction [20], image recognition [8], 
speech/speaker recognition [15, 19, 25], etc. In this 
paper, we study a robust modelling issue regarding 
GMMs. This issue is important, since GMMs are often 
used as fundamental components to build some more 
complicated models, such as HMMs. Thus, the method 



studied in this paper will be useful for other models as 
well. 

The standard training method for GMMs is Maximum 
likelihood Estimation (MLE) [2] based on the 
Expectation Maximisation (EM) algorithm [4]. Though 
it has been proved effective, this method still lacks 
robustness in its training process. For instance, it is not 
robust against gross outliers and cannot compensate 
the impacts from the out-liers contained in a training 
corpus. As it is well known, outliers often widely exist 
in a training population, due to the clean data often 
either being contaminated by noise, or interfered by 
the objects other than the claimed data. Out-liers in the 
training population may distract the parameters of the 
trained models to inappropriate locations [3] and can 
therefore break the models down and result in poor 
perform-ance of a recognition system. In order to solve 
this problem, a partial trimming scheme is introduced 
by Cuesta et al [3] to improve the robustness of the 
statistical models, such as K-means and GMMs. In 
their method, a prefixed proportion of data samples is 
removed from the training corpus and the rest of the 
data are used for model training. This method has 
been found robust against gross outliers when it is 
applied to small scale data examples. 

However, when this method is used in real-world 
appli-cations, such as noisy acoustic signal processing, 
it is hard to know the optimal cutoff proportion that 
should be used, since one does not know what faction 
of the data should be taken away from the overall 
training population. A bigger or smaller removal 
proportion would result in either removing too much 
useful information or having no effect on remov-ing 
outliers. To attack this issue, we propose a new 
method by using a dispersion degree model with two 
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different distance metrics to identify the outliers 
automatically. 

The contributions of our proposed method are three- 
fold: First, we suggest to use the trimmed K-means 
algorithm to replace the conventional K-means 
approach to initialise the parameters of GMMs. It is 
showed in this paper that appropriate initial values 
for model parameters are crucial for the robust 
training of GMMs. Second, we propose to use the 
dispersion degree of the training data samples as a 
selection criterion for automatic outlier removal. We 
theoretically prove that the dispersion degree 
approximately obeys a certain distribution, depending 
on the measure it uses. We refer this method as the 
automatic robust estimation with a trimming scheme 
(ARE-TRIM) for Gaussian mixture models hereafter. 

Third, we evaluate the proposed method on a real- 
world application with a moderately-sized speaker 
recognition task. The experimental results show that 
the proposed method can significantly improve the 
robustness of the conventional training algorithm for 
GMMs by making it more robust against gross outliers. 

The rest of this paper is organised as follows: In 
Section 2, we present the framework of our proposed 
ARE-TRIM training algorithm for GMMs. In Section 3, 
we present the trimmed K-means clustering algorithm 
and compare it with the conventional K-means. In 
Section 4, we propose the dispersion degree model 
based on two distance metrics and use it for ARM- 
TRIM. We carry out the experiments to evaluate ARE- 
TRIM in Section 5 and finally we conclude this paper 
with our findings in Section 6. 

Framework of Automatic Trimming 

Algorithm for Gaussian Mixture Models 

The proposed ARE-TRIM algorithm essentially 
includes several modifications to the conventional 
training method of GMMs. The conventional GMM 
training algorithm norm-ally consists of two steps: the 
K-means clustering algorithm for model parameter 
initialisation 1 and the EM algorithm for parameter fine 
training, as illustrated in Fig. 1. It is well known that 
appropriate initial values for model parameters 
crucially affect the final performance of GMMs [2]. 
Inapp-ropriate initial values could cause the fine 



1 Some other model initialisation methods exist, such as mixture 
splitting, but the initialisation with K-means is comparable in 
performance to the others and also more popular. So we stick to K- 
means. 



training with EM at the second step to be trapped at a 
local optimum, which is often not globally optimal. 
Therefore, it is interesting to study more robust 
clustering algorithms to help train better model 
parameters. 

I GMMs \* 1 
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FIG. 1 THE CONVENTIONAL TRAINING ALGORITHM FOR 
GMMS 

The ARE-TRIM is focused on modifying the first stage 
of the conventional GMM training algorithm, i.e., the 
model parameter initialisation algorithm, but keeping 
the fine training step with EM untouched. The 
modification consists of two steps: (1) substituting for 
the conventional K-means clustering algorithm by a 
more robust trimmed K-means clustering algorithm. 
In Section 3, we shall give more details concerning the 
reason why the trimmed K-means clustering 
algorithm is more robust than the conventional K- 
means clustering algorithm. (2) applying a dispersion 
degree model to identify the outliers automatically 
and then trim them off from the training population. 
In this procedure, no other extra step is required to 
govern or monitor the training process, and the overall 
procedure is automatic and robust and therefore is 
referred to as automatic robust estimation (ARE). The 
overall ARE-TRIM procedure is illustrated by Fig. 2. 

In the next section, we shall explain why the trimmed 
K-means clustering algorithm is more robust than the 
conventional K-means clustering method and present 
the theorems and properties related to the trimmed K- 
means clustering algorithm. 
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FIG. 2 THE ARE-TRIM TRAINING SCHEME FOR GMMS 
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Trimmed K-means clustering algorithm 

At the first step, ARE-TRIM replaces the conventional 
K-means clustering algorithm by the trimmed K- 
means clustering algorithm to initialise model 
parameters for GMMs. The good property of 
robustness of the trimmed K-means clustering 
algorithm is very important for improving the 
robustness of ARE-TRIM. Hence, in this section, we 
shall present some properties of the trimmed K-means 
clustering algorithm as well as clarify why it is more 
robust against gross outliers than the conventional K- 
means clustering algorithm. 

Algorithm Overview 

The trimmed K-means clustering is an extended 
version of the conventional K-means clustering 
algorithm by using impartial trimming scheme [3] to 
remove a part of data from a training population. It is 
based on a concept of trimmed sets. First let 
D = {x,,fe[l,r]} define a given training data set 

and a trimmed set is then defined by D a as a subset of 

D by trimming off OC percent of data from the full 
data set D in terms of a parameter a , where 
fle[0,l] . The trimmed K-means clustering 
algorithm can be specifically defined as follows: the 
objects xeD a are partitioned into K clusters, 

C = {C k },k = 1,...,K , based on the criter-ion of 
minimising intra-class covariance (MIC) V , i.e., 

K 

C t =argminV(D ff IC) = argminX! S ll x- <"J 2 ' ^ 

where jii k is the centroid or mean point of the cluster 
C k , which includes all the points xeC k in the 
trimmed set, i.e., X € D a and V(D a I C) are the intra- 
class covariance of the trimmed data set D a with 
respect to the cluster set C , i.e., 

V(D a \C) = ± X ||x-^|| 2 . 

The difference between the trimmed K-means and the 
conventional K-means is that the former uses a 
trimmed set for training, whereas the latter uses the 
full training popu-lation. 

A solution to the above optimisation problem can be 
iteratively sought by using a modified Lloyd's 
algorithm, as follows: 

Algorithm 1 (Trimmed K-means clustering algorithm): 



1. Initialise the centriods jU k of K clusters. 

2. Detect outliers by following a certain principle in 
terms of the current centroids fJL^f 1 , which will be 
described in Section 4. Meanwhile a trimmed 
training set D^ n) for the n -th iteration is 
generated. 

3. Cluster all the samples in the fl -th trimmed 
training set, i.e., x t e D^ n) into K classes, 
according to the following principle: 

I l|2 

4. i = argmin x, - jjf ,Vx t eD ff . 

k ii ii 

5. Re-calculate the centroids of A' -clusters accord- 
ing to the rule of 

7. Check the intra-class covariance V as in eq. (2) to 
see if it is minimised 2 . If yes, stop; otherwise, go 
to step (1). 

Existence and consistency of the trimmed K-means 
algorithm have been proved by Cuesta-albertos et al. 
in [3], where it states that, given a p -dimensional 

random vari-able space X , a trimming factor a , 
where ffe[0,l] , and a continuous nondecreasing 
metric function, there exists a trimmed K-means for 
X [3]. Furthermore, it is also proved that for a p - 
dimensional random variable space with probability 
measure P(X) , any sequence of the em-pirical 
trimmed K-means converges surely in probability to 
the unique trimmed K-means [3]. 

Why Trimmed K-means is More Robust than 
Conventional K-means? 

The robustness of the trimmed K-means algorithm, 
has been theoretically identified by using three 
methods in [9]: Influence Function (IF), Breakdown 
Point (BP) and Qualita-tive Robustness (QR). The 
results in [9] surely show that the trimmed K-means is 
theoretically more robust than the conventional K- 
means. Next, we shall overview these results briefly. 

The IF is a metric to measure the robustness of a statis- 
tical estimator by providing rich quantitative and 



2 In implementation, the minimum value of the intra-class 
covariance can be obtained by many methods, one of which is by 
checking if the change of the intra-class covariance V as in eq. (2) 
between two successive iterations is less than a given threshold. 
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graphical information [11]. Let X be a random 
variable and S(x) be the probability measure, which 
gives mass 1 to X , then the IF of a functional or an 
estimator I at a distribution F is defined as the 
directional derivative of T at F , in the direction of 
8{X): 

IF(x;T,F) = \im{T((l-t)F + tS(x))-T(F)}/t (5) 

for those X in which this limit exists. From the 
definition, we can see that the IF is a derivative of T at 
the distribution F in the direction of 8 (x) , as the 
derivative is calculated based on increasing a small 
amount of T towards 5{x) . Therefore, the IF 
provides a local description of the behaviour of an 
estimator at a probability model, such that it must 
always be complemented with a measure of the global 
reliability of the functional on the neighbourhood of 
the model, in order to capture an accurate view for the 
robustness of an estimator T . 

Such a complementary measure is the so-called break- 
down point (BP) [5], which provides a measure of how 
far from the model the good properties derived from 
the IF's of the esimator can be expected to extend [9]. 
The BP measure uses the smallest fraction of corrupted 
observations needed to break down an estimator T , 
S n (T, X ) , for a given data set X . 

Besides the above two measures, the qualitative 
robust-ness (QR) is another method to measure the 
robustness of an estimator. The QR, proposed by 
Hampel [10] is defined via an equicontinuity condition, 
i.e., given a real distribution T and a sequence of 
estimators {T n }™ =1 , we say T is continu-ous if 
T n —> T(F) for n — > co, at a distribution F . 

By using these three measures, we can compare the 
robustness between the trimmed K-means and the 
traditional K-means. The principal conclusions, 
credited to [9], can be summarised as follows: 

1. IF: for the K-means, the IF is bounded only for 
bounded penalty (error) functions used in the 
clustering procedure, whereas it is bounded for the 
trimmed K-means for a wider range of functions and 
practically IF vanishes outside each cluster. 

2. BP: the smallest fraction of corrupted observations 
needed to break down the K-means estimator, 
£ n {T,X), is l/n for a given data set X with n 
points, where it could be near to 1 over the trimming 



size for well clustered data for the trimmed K-means. 
Thus, the BP of the trimmed K-means is much larger. 

3. QR: There is no QR for the K-means in the case of 
K > 2 while QR exists for every K > 1 for the 
trimmed K-means. 

The theory of robust statistics has laid solid theoretic 
foundations for the robustness of the trimmed K- 
means, which in turns provide strong supports for the 
advantages of the proposed ARE-TRIM over the 
conventional GMM training algorithm. 

The robustness of the trimmed K-means can be 
illustrated by Fig. 3. In Fig. 3, two clusters are assumed 
to represent two groups of data A and B . Most of 
data A and B are clustered around their clusters C l 
and C 2 , except an outlier A' for the cluster C, . In fact, 

the outlier point A' is referred to as a breakdown 
point, i.e., BP. In this case, the classic K-means surely 
breaks down due to the existence of the BP A', and 
therefore two clusters C, and C 2 are generated. 
However, for the trimmed K-means, the algorithm 
does not break down with an appropriate trimm-ing 
value, and the two right clusters C { and C 2 are still 
able to be sought. This illustrates the robustness of the 
trimmed K-means. 




c, c, 



FIG. 3 ILLUSTRATION TO THE ROBUSTNESS OF THE TRIMMED 
K-MEANS CLUSTERING 

Modelling data dispersion degree for 
automatic robust estimation 

In previous work [3], a fixed fraction of data is 
trimmed off from a full training set, in order to realise 
the training of the trimmed K-means. However, in real 
applications, e.g., acoustic noise reduction and speaker 
recognition tasks, a fixed cut-off data strategy may not 
be able to successfully trim off all the essential outliers, 
as test data do not often have a fixed number of 
outliers. In this paper, we propose that this issue can 
be solved by using a model of data dispersion degree. 
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We start our definition for data dispersion degree 
from the simplest case, i.e.,the dispersion degree of a 
data point in terms of a cluster. For a cluster C with a 

mean vector jU and an inverse co variance matrix L -1 , 
a dispersion degree, d(x\ \ c) , of a given data point X , 
with which the data point deviates from the cluster c 
is defined as a certain distance 

d(x 1 1 c) = distance(x \ \ c). (6) 

A variety of distances can be adopted for this metric. 
For instance, the Euclidean distance can be used to 
represent the dispersion degree of a data point X and 

a cluster c = {jU, } , i.e., 




where d is the dimension of the vector X . 

Apart from this, the Mahalanobis distance (M-distance) 
can be also used. By this metric, the dispersion degree 

of a data point X in terms of the class C = {//,£ l } 

can be defined as follows: 

J(x||c) = (x-//) r i:- 1 (x-//). (8) 

The concept of dispersion degree of a data point in 
terms of one cluster can be easily generalised into a 
multi-class case. For a multi-class set 
C = {c k },k g[1,K] , the dis-persion degree 

d (x 1 1 C) of a data point X in terms of a multi-class 
set C is then defined as the dispersion degree of the 
data point X in terms of the optimal class to which the 
data point belongs in a sense of the Bayesian decision 
rule, i.e., 

d(x\\C) = d(x\\ Cj ), (9) 
where 

7 = argmaxP/-(xlcJ (10) 

and Pr(x I C k ) is a conditional probability of the data 
X generated by C k . In statistical modelling theory, the 
condi-tional probability Pr(x I c k ) often takes a form 
of normal distribution with a mean vector jU k and an 
inverse cova-riance matrix 2^ , e.g., 

Pr(xlq) = N(xl// t ,2:- 1 ). (11) 

Based on the above definitions, we can further define 
the dispersion degree of a data set X = {x t },t e [l,T] 
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in terms of a multi-class C = {c k }, k € [1, K] as a 
sequence of the dispersion degree of each data point 
x r in terms of a multi-class set C, i.e., 

d(X\\C) = {d(x t \\C)},te[l,T]. 

Given a data set X , if we use a random variable y to 
represent the dispersion degree of a data set X in 
terms of a multi-class set C, then we can prove that 
under a certain distance metric, the random variable 
y approximately ob-eys a certain distribution when 

T , the size of the data set X , approaches infinity, i.e., 
T — > +co . For this, we have our first theorem with 
respect to the application of the Euclidean distance as 
below. 

Theorem 4.1 Let the random variable 
X = {Xj,x 2 , - • • ,x r } be a set of data points to represent a 
d-dimensional space and can be optimally modelled by a 
GMM G , i.e., 

X-G^N^E- 1 ), (12) 

k=\ 

Where K is the number of Gaussian components of the 
GMM G , w k , jU k and H k l are the weight, mean and 
dia-gonal inverse covaraince matrix of the k -th Gaussian 
with cr k f ,l<i< d at the diagonal of*L~ k . 

Let C = {c k },k € [1,^] denote a multi-class set, wh-ere 
each of its elements c k represents a Gaussian compo-nent 
N( y u jt ,E^ 1 ) for the GMM G . Each sample x t of the data 
set X can be assigned to an optimal class C-, under the 
Bayesian rule, i.e., selecting the maximum conditional 
probability of the data point x t given a class c k , which is 
the k-th Gaussian of the model G , 

j = argmaxPr(x, \c k ) = arg m axN(x ( \/j. k ,Tr k ). (13) 

k=l k=l 

Let's also assume a random variable y to represent the 

dispersion degree of the data set X in terms of the multi- 
class C , which includes the dispersion degree of each sam- 
ple in the data set X that is derived based on the conti- 
nuous Euclidean distance as defined in eq. (7), i.e., 

y = {|x« -ju k \\},k e [l,KUndt e [l,T]. (14) 

Then we can prove that the distribution of the random 
variable y is approximately a chi distribution %(x;v) 
with the V degrees of freedom ([1]), i.e., 
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y~X(x;v) = 



r<|> 



(15) 



where 



K (IX) 3 
v = IH 



(=] (i>;) 2 

7=1 



(16) 



and T(z) is the Gamma function, 
t zl e'dt. 





(17) 



Proof: By applying the decision rule based on the 
optimal conditional probability distribution of a data 
point x ( and a class c k , we can partition the overall 
data set X into a disjoint set, i.e., 

X = {x (1) ,x (2) ,---,x m }. (18) 

According to the essential training of GMM, e.g., the 
ML training, any data point x t is assumed to be 
modelled by a normal distribution component of 
GMM, thus, we naturally have x (i) ~ ti(/i k ,JL~ 1 ) in a 
sense of the Bayesian decision rule. Thus we have 

According to eq. (7), we know that 

2 



ytk) = 



d _ ,. d 

=5>S( 5 - Z ^) 2 =5>5- Z 2 (i), 



!=1 



ki 



i=l 



(19) 



thus, is in fact a linear combination of chi-square 
dis-tributions ^ 2 (1) with 1 degree of freedom. 

According to [13], it is easy to verify that y {k) 
approximately obeys a chi-square % 2 (v k ) , where 

d 



(IX) 3 



V, =-^i 



(20) 



(IX) 2 



(see Appendix for detailed derivation). 

To this point, we should be able to get a sequence of 
variables y 2 {k) , each of which is a ^(v*) distribution 



with V k degrees of freedom, for each partition X (i:) of 
the data X . 

If we define a new variable y , the square of which 
can be expressed as follows: 

K 

(21) 



y 2 = 



then we can prove that y 2 approximately obeys 

another chi-square distribution % 2 2 (v) , with the 

y 

degrees of free-dom V , where 



(22) 



i=1 CErl? 

7=1 



2 

The proof is straightforward by noticing that y is in 

fact a linear combination of chi-square distributions 

. Thus, we can prove this theorem by using 

Theorem 6.1. For this, we can have the parameters as 
follows: 



4=0 



(IX-) 3 



h = v = 



(IX) 2 



K 



d 



7=1 



S l - 3/2 
L 2 



^2 — 2 



(23) 
(24) 

(25) 
(26) 

(27) 
(28) 



Because 



1 




K" 


P 2 


1 


1 


K* P ' 





= 1, 



(29) 
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where 



(30) 



2 2 

we have Sj = s 2 ■ Therefore, we know that y is 
approxi-mately a central chi-square distribution 

x](y,S) with 



s = o 



1 



c 2 JT 
V ~ cl ~ 1 

L 3 

K 6 



= P- 



(31) 



(32) 



Further, we can naturally obtain that y ~ % (v) . 

Then the proof for this theorem is done. 

Remarks: Theorem 4.1 shows that the dispersion degree 
of a data set in terms of the Euclidean distance 
approxi-mately obeys a chi distribution. However, 
when the degrees of freedom V is large, a % 
distribution can be approxi-mated by a normal 
distribution N(jU,c) . This result is es-pecially useful 
for real applications, for its easy computation, when V 
is very large. To this end, the following theorem holds 
[2]: 

Theorem 4.2 A chi distribution %(v) as in eq. (15) can be 
accurately approximated by a normal distribution 
N(//,<7 2 ) for large V s with /u = yfv — l and (7 Z = 1/2 
[21. 

Proof: The proof is straightforward by using Laplace 
appro-ximation [2, 21]. According to Laplace 
approximation, a given pdf p(x) with its log-pdf 
f(x) = In p(x) can be approximated by a normal 
distribution H{x max ,<7 ) , where x max is a local 



maximum of the log-pdf f(x) and a 2 = 



1 



/ ( X max) 



So the proof is to find x n 
derive 



and a . For this, we can 



f(x) = In p(x) = In x v 1 + In e 



(33) 



by using the definition of chi distribution as in eq. (15) 

-1 



V i-- 

and ignoring the constant T( — ) and 2 2 



By setting the first-order derivative of f(x) to zero, 
we get 

df(x) v-l 



-x = 0. 

dx x 

Hence, X max = Vv-1 and 
1 



a 2 = 



f'(x)\x n 



= -1 



1 



d( x) 

x 



dx 



(34) 



1 

2" 



(35) 



The derivation is done. 



Next, we shall present the other main result of 
dispersion degree modelling in regard to using the 
Mahalanobis dis-tance in the following theorem. 

Theorem 4.3 Let the random variable 
X = {x p x 2 ,- • • ,x r } be a set of data points to represent a 
d-dimensional space and can be optimally modelled by a 
GMM G , i.e., 



X~G = X^N(// t ,2: t 1 ), 



(36) 



k=\ 



where K is the number of Gaussian components of the 
GMM G , w k , jU k and ~L k l are the weight, mean and 
dia-gonal inverse covaraince matrix of the k -th Gaussian 
with <7 k 2 , 1 < i : < d at the diagonal ofZ^. 

Let C = {c k },k e[l,K] denote a multi-class set, where 
each of its elements c k represents a Gaussian com-ponent 
N(// jt ,E^ 1 ) for the GMM G . Each sample x t of the data 
set X can be assigned to an optimal class C-, under the 

Bayesian rule, i.e., selecting the maximum conditional 
probability of the data point x t given a class c k , which is 
the k-th Gaussian of the model G , 

j = arg max Pr(x t \ c k ) = arg max N(x, I ju k , E^ 1 ). (37) 

k=\ k=l 

Let's also assume a random variable y to represent the 

dispersion degree of the data set X in terms of the multi- 
class C , which includes the dispersion degree of each 
sample in the data set X that is derived based on the 
continuous Mahalanobis distance as defined in eq. (8), i.e., 

y = {(*»> -M k ) T ^ k l (xf } - Mk ):ke[UKlt e [l,T]}. (38) 
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Then the distribution of the random variable y is 
approximately a chi-square distribution with V degrees of 
freedom 11], i.e., 

1 



y ~ z {x ' v) = ^r(v/2) 

where 
v = dK. 



^2-1^2 



(39) 



(40) 



Proof: Following the same strategy, the data set X 
can also be partitioned into a 
class X = {x (1) , X (2) , • • • , X W } . By noting that 

y (k) = (x w -fi k ) T T,l\x m -fi k ) ~ X \x (k \d), then we 
get y w ~ Z 2 (x (k \d). Further, let 



y = - 



K 



(41) 



then we can prove that 

y~ X \x,v) (42) 
where V is given by eq. (40). 

The prove is a straightforward result by applying 
Theo-rem 6.1, as y is in fact a linear combination of 

X distri-butions of y (k) . For this, we can easily know 

that 



K 


(43) 


h k =d 


(44) 


S k =0 


(45) 




(46) 


S 2_CI_J_ 

1 c\ dK 


(47) 


Sl c\ dK ' 


(48) 



As s x = S 2 , therefore, y approximately obeys a 
central % 2 distribution ;f 2 (v,<J) with S = and 



v = % = dK. 



From these two theorems, we can see that the 
dispersion degree of a data set in terms of a multi-class 
set can be therefore modelled by either a chi 
distribution or a chi-square distribution depending on 
the distance measure applied to. These are the 
theoretical results. In practice, the normal distribution 
is often used instead due to its fast computation and 
convenient manipulation, to approximate a chi and 
chi-square distribution, especially when the number of 
the degrees of freedom of the chi (chi-square) 
distribution is large, e.g., V > 10 . For the % 
distribution in eq. (15), we have known that it can be 
approximated by using Theorem 4.2. While for the X 
distribution in eq. (39), we have the following result. 

Theorem 4.4 A chi-square distribution Z 2 ( v ) can be 
approximated by a normal distribution N(v,2v) for large 
V s, where V is the number of the degree of freedom of 
X\v) [121. 

After clarifying the dispersion degree essentially obeys 
a certain distribution, we can apply this model to 
automatic outlier removal in ARE-TRIM by using the 
following definition: 

Definition 4.1 Given a dispersion degree model M for a 
data set X and a threshold T , where X > , any data 
point x is identified as an outlier at the threshold T , if the 
conditional cumulative probability P(x I M) of the 
dispersion degree of the data point x conditioned on the 
dispersion degree model M is larger than a threshold T , 
i.e., 



P(xlM)=f p(y I M)dy > r. 

J— CO 



(49) 



With Definition 4.1 and a proper value selected for the 
threshold T , the outliers can be automatically 
identified and thus removed by our proposed ARE- 
TRIM training scheme. The detailed algorithm is 
formulated as follows, including two main parts, i.e., 
the estimation and identification pro-cesses: 

Algorithm 2 (ARE-TRIM training algorithm) : 

1. Estimation of the model M : according to Theo- 
rem 4.1, 4.3, 4.2, 4.4, y ~ N(6>) , where = (jU,cr 2 ). 
Then can be asymptotically estimated as 
6 = (jU, (7 ) by the well-known formulae 



Then the proof is done. 



JU = J - L - 



(50) 
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* = ^(|> f -£) 2 ) (51) 

where 

y t =d(x t \\C). (52) 

2. Outlier identification: 

(a) For each data sample x ( do; 

i. Calculate y r according to eq. (52); 

ii. Calculate the cumulative probability 
P(x t I M) according to eq. (49); 

iii. If P(x t I M) > T , then X, is identified as 
an outlier and is trimmed off from the training set; 
otherwise x ( is not an outlier and thus used for GMM 
training. 

(b) Endfor 

Further Remarks: In the theory, you may notice that 
both of theorem 4.1 and 4.3 reply on using the 
Bayesian decision rule for the partition process. As it is 
well known in pattern recognition domain [6], the 
Bayesian decision rule is an optional classifier in a 
sense of minimising the decision risk in a squared 
form [6]. Hence, the quality of the recog-nition process 
should be able to satisfy the requirements of most of 
the pattern recognition applications. In the next 
section, we shall carry out experiments to show that 
such a procedure is effective. 

Experiments 

In previous sections, we have discussed the ARE- 
TRIM algorithm from a theoretical viewpoint. In this 
section, we shall show its effectiveness by applying it 
to a real signal processing application. 

Our proposed training approach aims at improving 
the robustness of the classic GMMs by adopting the 
automatic trimmed K-means training techniques. Thus, 
theoretically speaking, any application using GMMs, 
such as speaker /speech recognition or acoustical noise 
reduction, can be used to evaluate the effectiveness of 
our proposed method. Without loss of genera-lisation, 
we select a moderately-sized speaker recognition task 
for evaluation, as GMM is widely accepted as the most 
effective method for speaker recognition. 

Speaker recognition has two common application 
tasks: speaker identification (SI) (recognising a speaker 



identity) and speaker verification (SV) (authenticating 
a registered valid speaker) [17]. Since the classic 
GMMs have been demonstrated to be very efficient for 
SI [19], we simply choose the SI task based on a 
telephony corpus - NTIMIT [7] - for evaluation. 

In NTIMIT, there are ten sentences (5 SX's, 2 SA's and 
3 Si's) for each speaker. Similar to [16], we used six 
sentences (two SX X _ 2 , two 5Aj_ 2 and two SI X 2 ) as 
train-ing set, two sentences ( SX 3 and SI 3 ) as 
development set and the last two SX utterances 
(SX 4 _ 5 ) as test set. The development set is used for 

fine tuning the relevant para-meters for GMM training. 
With it, we select a variance threshold factor of 0.01 
and minimum Gaussian weight of 0.05 as optimum 
values for GMM training (performance falling sharply 
if either is halved or doubled). 

As in [14, 17, 26-28], MFCC features, obtained using 
HTK [29], are used, with 20ms windows and 10ms 
shift, a pre-emphasis factor of 0.97, a Hamming 
window and 20 Mel scaled feature bands. All 20 
MFCC coefficients are used except cO. On this database, 
neither cepstral mean subtraction, nor time difference 
features increase perform-ance, so these are not used. 
Apart from these, no extra processing measure is 
employed. 

Also as in [16-18], GMMs with 32 Gaussians are 
trained to model each speaker for SI tasks. All the 
Gaussians use diagonal covariance matrices, as it is 
well-known in speech domain that diagonal 
converiance matrices produce very similar results to 
full converiance matrices [15, 29]. Also, the standard 
MLE method [2] based on the EM algorithm [4] is used 
to train GMMs, due to its efficiency and wide appli- 
cation in speech processing. 

In the trimmed K-means step, random selection of K 
points is used to initialise the centroids of the K 
clusters. 

Experiments with Euclidean distance 

We first present the experiment of testing the 
dispersion degree model by using the Euclidean 
distance. In this expe-riment, different values for the 
threshold T , according to Definition 4.1, from 1.0 to 
0.6 are used to trim off the outliers existing in the 
training data, r = 1 .0 represents no outlier is pruned, 
whereas T = 0.6 means a maximum number of 
outliers are identified and trimmed off. The proposed 
method is then tested on both the development set and 
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test set. The development test is used to select the 
optimal value for the trimming threshold T . The 
results are presented in Fig. 4. 
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0.620.04 0.06 0.6B 0.7 0.72 0.74 0.76 0.7B 0.8 0.82 0.B4 0.8S 0.8B 9 0.92 0.94 0.96 

trimming threshold x as defined in definition 1 

FIG. 4 THE ARE-TRIM GMMS VS. THE CONVENTIONAL 
GMMS ON NTIMTT WITH THE EUCLIDEAN DISTANCE 

From Fig. 4, we can see that: (1) the proposed method 
does improve system performance on both the 
development and test set with the threshold 
re [0.6,1.0). The accuracy of ARE-TRIM for all the 
threshold values on the develop-ment and for most of 
them on the test set is higher than that of the 
conventional GMM training method. This shows the 
effectiveness of the proposed method. (2) The values 
of threshold T can not be too small; Otherwise, they 
will remove too much meaningful points that are 
actually not outliers. It may result in unpredictable 
system performance (though most of them are still 
helpful to improve system performance as shown in 
Fig. 4). This can be showed in Fig. 5, where we give 
the averaged proportions of the trimmed samples 
corresponding to each value of the threshold X . 




0.5 0.62 0.64 0.66 0.68 0.7 0.72 0.74 0.76 0.73 0.6 0.82 0.04 0.86 0.56 0.9 0.92 0.94 0.96 0.90 1 

trimming threshold t as defined in definition 1 

FIG. 5 THE AVERAGED PROPORTIONS OF TRIMMING DATA 
REGARDING DIFFERENT THRESHOLD VALUES BASED ON 
THE EUCLIDEAN DISTANCE (EUCLIDEAN-DIST) AND THE 
MAHALANOBIS DISTANCE (M-DIST) 



The "averaged" proportions are obtained across from a 
number of training speakers and a number of 
iterations. Before each iteration of the EM training, a 
trimmed K-means is applied using a trimmed set with 
the threshold T . From this figure, we can see that the 
averaged trimming proportions increase when the 
values of the trimming threshold T move away from 
1.0. (3) In practice, we suggest the range of [0.9,1.0] 
be used to select an optimal value for T , as it is a 
reasonable probability range for identifying outliers. 
Out of this range, valid data are highly possibly 
trimmed off. This can be also partly shown in Fig. 5, as 
0.9 is roughly corresponding to 10% of data being 
removed. (4) The trend of performance changes on the 
development and test set shows a similar increase 
before the peak values are obtained at the 6% outlier 
removal, with the improvements from 50.0% to 
53.40% on the development set and from 58.95% 
to 61.73% on the test set. After 6% outlier re-moval, 
system performance varies on both the development 
and test set. It implies that the outliers in the training 
data have been effectively removed and more robust 
models are obtained. However, when more data are 
removed with T taking values beyond 0.96 , useful 
data are removed as well. Hence, system performance 
is demonstrated as a threshold- varying characteristic, 
depending on which part of data is removed. 
Therefore, we suggest in practice [0.90,1.0) be used, 
from which a suitable value for the threshold X is 
selected. In this experiment, we choose 0.96 as an 
optimum for T based on the development set. 

Experiments with the Mahalanobis distance 

When we use the Euclidean distance to model 
dispersion degree, we do not take into consideration 
the covariance of the data in a cluster but only 
consider the distances to the centroid. However, the 
covariance of a cluster may be quite different, and it 
may largely affect the distribution of data dispersion 
degrees. Thus, in this experiment, we evaluate the use 
of the Mahalanobis distance for the automatic trim- 
ming measure in ARE-TRIM. 

From Fig. 6, we can see that for both the development 
and test set, the trimmed training scheme can 
significantly improve the robustness and system 
performance. The highest accuracy on the 
development set is improved from 50.0% to 
55.25%, with the trimming factor r = 0.92, where 
the accuracy for the test set is improved from 58.95% 
to 61.11% . This shows that the automatic trimming 
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scheme, i.e., ARE-TRIM, is quite effective to improve 
the robustness of GMMs. 
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FIG. 6 THE ARE-TRIM GMMS VS. THE CONVENTIONAL GMMS 
ON NTTMIT WITH THE MAHALANOBIS DISTANCE 

When the values of the trimming threshold are too 
large, similar to the trimming measure with the 
Euclidean distance, speaker identification accuracy 
shows a certain threshold-varying behaviour, because, 
in this case, not only the out-liers but also some 
meaningful data samples are trimmed off as well. This 
is similar to the case with the Euclidean distance. 
Furthermore, by comparing Fig. 4 and 6, we can find 
that the improvements on the development set by 
using the Mahalanobis distance (55.25%) are larger 
than those obtained by using the Euclidean distance 
( 53.39% ). It may suggest that the Mahalanobis 
distance is a better metric to model dispersion degrees, 
because of consideration of the covariance factor in 
modelling. 

Conclusions 

In this paper, we have proposed an automatic 
trimming estimation algorithm for the conventional 
Gaussian mixture models. This trimming scheme 
consists of several novel contributions to improve the 
robustness of Gaussian mixture model training by 
effectively removing outlier interference. First of all, a 
modified Lloyd's algorithm is proposed to realise the 
trimmed K-means clustering algorithm and used for 
parameter initialisation for Gaussian mixture models. 
Secondly, data dispersion degree is proposed to be 
used for automatically identifying outliers. Thirdly, 
we have theore-tically proved that data dispersion 
degree in the context of GMM training approximately 
obeys a certain distribution (chi and chi-square 
distribution), in terms of either the Eu-clidean or 



Mahalanobis distance being applied. Finally, the 
proposed training scheme has been evaluated on a 
realistic application with a medium-size speaker 
identification task. The experiments have showed that 
the proposed method can significantly improve the 
robustness of Gaussian mixture models. 

Appendix 

Theorem from [13] 

We shall first cite the result from [13] as Theorem 6.1, 
and then use it to derive the result used in the proof 
for Theorem 4.1. 

Theorem 6.1 Let Q(X) be a weighted sum of non-central 
chi-square variables, i.e., 



(53) 



where h j is the degrees of freedom and 8 t is the non- 

centrality parameter of the l -th % distribution. 
Define the following parameters 



c k =Y J ^h i +kY j X k i 8 i 



(54) 



1=1 



s l = c 3 /c 



3/2 



(55) 

s 2 = cjc\, (56) 

then Q(X) can be approximated by a chi-square 
distri-bution %f (S) , where the degrees of freedom / 
and the non-centrality 8 are divided into two cases: 



\a 2 -28, ifsf > s 2 , 



I 0, if^ 2 <^ 2 
[sfl 3 - a 2 , if sf > s 2 

and 

a = l/(s l -Jsf—s 2 ). 
Useful result for the proof of Theorem 4.1 



(57) 



(58) 



(59) 



Next, we shall use Theorem 6.1 to derive the distri- 
bution of v (jt) in eq. (19) used in the proof of Theorem 

4.1. For the simplicity of presentation, we drop off the 
superscript (k) without any confusion, as it is clear in 
the context that this derivation procedure is regarding 
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the k -th partition of the data set X . From eq. (19), we 



know that y 2 is a linear combination of 1 

2 

dimensional % distri-bution, i.e., 



3> 2 = 2>, 2 -Z 2 (l). 



(60) 



As % 2 (l) is central, we have h t = 1 , 8 i = and 
X i = cf in our case. With these quantities, it is easy to 



know that 



h = Z CT > 

<=i 



2k 



If' 



s, = 



— 3 _ i=l 



1 „3/2 3 



(61) 



(62) 



: (Z^) 2 

1/4 



— _± — i=l 



2~ 2 ~ d 



(63) 



(Z^ 4 ) 2 



From this we can know 



sl<s 



2> 



(64) 



because 



<2X> 2 2>, ,2 +II-X 6 

_ i=l i j.jVi 



1 = 1 1 = 1 ! = 1 ! j'.jVi 



<1, 



(65) 



and 



c = Z Z-f- ■ - Z Z-> ■ = Z Z-- - ■ (-? - 

(66) 

By noticing a term A 

k = ata){a]-a)) (67) 
always has a symmetric pair, a term B , i.e., 



(68) 



then we can reorganise them as 

A + B = a?a*(af-a 2 )(a 2 -af)<0. (69) 



Thus, we know C < and further 
Therefore we have eq. (64). 



(70) 



As we know s x < s 2 , by using Theorem 6.1, we can 
obtain that the approximated zf(^) * s central with 
the parameters 

5 = (71) 



(ZO 3 



/ = 



„3 

'2 _ i=l 



i=i 



(72) 
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